Skip to content

Comments

Fall back to control connection when host pools are empty#722

Draft
dkropachev wants to merge 1 commit intoscylladb:masterfrom
dkropachev:fix/control-connection-fallback-720
Draft

Fall back to control connection when host pools are empty#722
dkropachev wants to merge 1 commit intoscylladb:masterfrom
dkropachev:fix/control-connection-fallback-720

Conversation

@dkropachev
Copy link
Collaborator

@dkropachev dkropachev commented Feb 23, 2026

Summary

  • When all hosts are marked IGNORED by the load-balancing policy (e.g. WhiteListRoundRobinPolicy with a NAT address not known to the cluster), no connection pools are created. Instead of raising NoHostAvailable on Session.connect(), the driver now logs a warning and falls back to executing queries on the already-established control connection.
  • Adds ResponseFuture._query_control_connection() method that borrows the control connection directly when session._pools is empty.
  • Adds integration test reproducing the exact scenario from the issue (connect via unadvertised NAT proxy with whitelist policy).

Fixes: #720

Test plan

  • Unit tests pass (pytest tests/unit/)
  • Integration test: SCYLLA_VERSION="release:2025.2" uv run pytest tests/integration/standard/test_public_address.py -s

@dkropachev dkropachev force-pushed the fix/control-connection-fallback-720 branch from 498b10c to 39ae69c Compare February 23, 2026 17:37
When all hosts are marked IGNORED by the load-balancing policy (e.g.
WhiteListRoundRobinPolicy with a NAT address not known to the cluster),
no connection pools are created. Instead of raising NoHostAvailable on
Session.connect(), log a warning and fall back to executing queries on
the already-established control connection.

Fixes: scylladb#720
@dkropachev dkropachev force-pushed the fix/control-connection-fallback-720 branch from 39ae69c to 5a24812 Compare February 23, 2026 20:06
@sylwiaszunejko
Copy link
Collaborator

@dkropachev I may be misunderstanding your approach, so I’d appreciate some clarification.

From what I see, the bug described in the issue started occurring in version 3.29.8, while everything worked correctly in 3.29.7. Because of that, I’m not sure how this PR would be reverting a newly introduced bug.

WhiteListRoundRobinPolicy with a NAT address not known to the cluster

This seems to be a wrong configuration. In such a case, I’m not sure whether introducing a fallback is the right solution.

In the issue the node address was properly recognized before the version upgrade, so I don't think we need fallback, but rather we need to find out where the introduced bug is. I will try to investigate this, I asked for more information in the issue (like logs).

@Lorak-mmk
Copy link

From what I see, the bug described in the issue started occurring in version 3.29.8, while everything worked correctly in 3.29.7. Because of that, I’m not sure how this PR would be reverting a newly introduced bug.
This seems to be a wrong configuration. In such a case, I’m not sure whether introducing a fallback is the right solution.
In the issue the node address was properly recognized before the version upgrade, so I don't think we need fallback, but rather we need to find out where the introduced bug is.

+1, fallback to CC is definitely not the right solution to anything.

@dkropachev
Copy link
Collaborator Author

dkropachev commented Feb 24, 2026

What happens is the following:

  1. User have tcp proxy for a node
  2. This proxy is not present neither in broadcast_rcp_address nor in rcp_address
  3. User opens a driver session targeting that tcp proxy
  4. Driver fails to open any connection for any node pool because information it pulls from the system.local and system.peers points to the addresses that are unreachable.

What driver was doing before #623 is the following:

  1. Connect to the cluster via endpoint
  2. Pull system.local and system.peers merging them together into single list, while it is merging it would copy-over endpoint of the node it is currently connected to, to preserve way it accesses the node.
  3. Due to the endpoint was copied-over driver was able to create one node-pool for the node cc was connected to.

Now, please, throw in ideas on how to fix this scenario ?

Creating node-pool using same endpoint is bad idea because in some cases, like Private-Link, driver is pointed to the lb that lands connection to a random node, so while driver thinking that it connects to the same node, it actually reaching out different nodes.

@Lorak-mmk
Copy link

Creating node-pool using same endpoint is bad idea because in some cases, like Private-Link, driver is pointed to the lb that lands connection to a random node, so while driver thinking that it connects to the same node, it actually reaching out different nodes.

Do we need to consider Private Link here? When we use it, driver will know about this and can behave differently.
Imo creating the pool using the address used to create the CC (as was done in previous version) makes sense - just don't do this for Private Link.

@dkropachev
Copy link
Collaborator Author

dkropachev commented Feb 24, 2026

Creating node-pool using same endpoint is bad idea because in some cases, like Private-Link, driver is pointed to the lb that lands connection to a random node, so while driver thinking that it connects to the same node, it actually reaching out different nodes.

Do we need to consider Private Link here? When we use it, driver will know about this and can behave differently. Imo creating the pool using the address used to create the CC (as was done in previous version) makes sense - just don't do this for Private Link.

We need to consider that when user points driver to something that is not a legit cluster node or entry point to the cluster, it could be anything, it could be a simple tcp proxy and one node load balancer, or cluster-wide load balancer.

In the scenario when it is anything but a legit entry point we need to make sure that driver doesn't missbehave, but still can execute some queries.
Private link here only as an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SCT failing to connect via public adddress with scylla-driver==3.29.8

3 participants